Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mpileup speedup #2

Open
wants to merge 242 commits into
base: develop
Choose a base branch
from
Open

Mpileup speedup #2

wants to merge 242 commits into from

Conversation

jkbonfield
Copy link

Speed ups to bcftools mpileup -Ou file.bam (ie simplest mode, default options, no ref and uncompressed BCF output).

  • The get_position() function now caches the read length so it doesn't have to scan through the entire cigar string for each and every base it operates on. This was O(N^2) complexity.

    WARNING: it does this by shoehorning it into an unused field in BAM.
    TODO: bite the bullet and break the ABI so we can put something into pileup1_t instead. I'll leave this up to you to thrash out. Probably fix the ghastly "aux" name too while at it. :-)

    The speed gain is HUGE on long low-accuracy reads.

  • Removal of some floating point divisions in bcf_call_glfgen(). We had things like baseQ/60.0 * nqual. Given the size of baseq and nqual we can do (baseQ*nqual) / 60 in integer instead.

    I tried changing the epos division too but it produces different results. This is actually due to a bug in the existing code with rounding errors. E.g. the integer math got a correct 58 while the floating point match got 57.9999999 and rounded down to 57.

    If you're happy to have different results, consider changing the epos calc to: int epos = ((int64_t)pos*bca->npos)/(len+1);. Instead I moved the initial division earlier up the function to give it time for the result to be computed before we use it.

  • The Mann Whitney 1947 function is now precomputed for all the values potentially used in this code. See mw.h. Anything outside the bounds is computed on the fly as before, just incase I missed something.

  • Calc_mwu_bias() main loop now has specific code for dealing with one or both of the a[] and b[] arrays being zero. This seems to be a significant speed gain given the sparsity of these arrays.

  • Lots of "TRACK_EXTENTS" bits; see the #define in bcf2bam.h.
    Basically this keeps track of the maximum values filled out in the bca->{ref,alt}_{epos,ibq,imq} arrays. It then uses these to shortcut some of the whole-array calculations as once it hits the runs of zeros the results don't change. The speed difference is very slight though. I leave it there for you to test, but the ifdefs show clearly how to cull it if you deem it not worth the extra complexity.

  • We no longer query the RG field over and over again for each base in the read. There's nowhere to put the value though! So in a complete hack I stashed it in the bam insert size. I don't recommend keeping this as is, but it works for now until we figure out the correct form of the new ABI/API. This is simply demonstration that caching such things is a significant win.

Other potential improvements:

  • The main loop is still column by column and within each column then seq by seq. We have many cases where the same pileup array occurs for many columns, but we always throw it all away and recompute. All we really need to do is a memmove left 1 byte.

    Similarly when finding the sample groupings, we do this over and over again despite the fact they're not changing.

    Fixing this is a larger restructuring that I didn't want to do myself. It is left as an exercise to the reader, but there are potentially huge speed gains to be had here.

On an Illumina data set (10 million reads), the timings went from 21m15s to 12m33s (approx 70% faster). On a PacBio data set the time dropped from 5m17s to 0m13s!

pd3 and others added 30 commits June 13, 2015 06:17
- rewritten for greater experimental flexibility
- the AA peak can be included in the fit (the -i option)
- the minimum fraction of aberrant cells as a command line parameter (-m)
- control the minimum cn4 bump size (-b)
This was hidden as it was/is experimental, but with the licensing
additions, users are explicitly requesting the command by compiling
with USE_GPL=1, so we might as well display in the help message.

Closes samtools#280
updates to polysomy command for publication
updates to roh command for publication
bcf_hdr_combine deprecated and replaced by bcf_hdr_merge in htslib
(2bb9370f5a24938d8a2dc56f404e584661bf413f).

Fixes samtools#208
- CNx state removed
- automatic optimization of parameters to increase sensitivity when
only a fraction of cells is aberrant (at the cose of decreased specificity)
- LRR smoothing
pd3 and others added 27 commits June 28, 2016 18:03
Not functional yet. Just copying over the files.

* bam_plcmd.c for the mpileup command
* bam2bcf.[ch] bam2bcf_indel.c for the VCF/BCF creation
* sample.[ch] for RG:SM handling
minimal changes to files copied from samtools in order to compile
in bcftools.

* use regidx from htslib rather than bedidx from samtools
* remove sam_opts calls as sam_opts.h not copied from samtools

Todo:

* copy over relevant functionality from sam_opts.h
* remove text based mpileup output
* update options and defaults
* bring over `---gvcf` and other changes from @pd3 fork
* deprecate `-g -v -u` options (still functional, but with warning)
* exit with message to use `samtools mpileup` if `-s/--output-MQ` used
* `-O` option was `--output-BP` and is not `--output-type` for consistency
  with other bcftools commands. If `[buzv]` not given as an option
  will warn

These catches for old text output options are probably not necessary
as users may not expect text output from `bcftools mpileup`.
* mpileup.1.out, mpileup.2.out, mpileup.3.out and mpileup.4.out
  are from samtools with mpileup.1.out and mpileup.3.out converted
  from the text output
* mpileup.5.out a new test with the newer AD, etc annotations
* sam/bam/cram test files all stored. perhaps there is some
  way to store one version and convert within the test ala
  the vcf-miniview in samtools?
adds to @4e7c8fb86349761fed1b290357dbc792222ecdcb
* remove deprecated `-g`, `-v`, `-u`, `-D`, `-V`, `-S`
* remove `-R` short option to make way for `--regions-file` option later
This commit brings over the `--gvcf` functionality from @pd3's branch,
consisting of relevant bits from cf3219c
and ee8210d

Reference only blocks will be merged into gVCF blocks when the
minimum per-sample depth falls in the intervals defined by the
argument to the `-gvcf` option. Documentaion added to explain
the merging and a test added.
pulling over of cf5c354
adding in `-S,--samples-file` option and exiting if no samples are
read from the file or list

TODO: add exclude logic with `^` prefix as in other bcftools commands

removed `config.h` from `sample.c` as leftover this is in samtools,
but not bcftools at the moment
switch `-t/--output-tags` option to `-a/--annotate` to make room for
the `-t/--targets` option

available annotations are now listed on request with `-a ?` rather than
cluttering up the help output.
This is meant as a temporary change while we extend the
regidx api, but allow bcftools code to use these changes
before they appear in some form in htslib.

This commit does not add new features, just copies over
`regidx.[ch]` and rejiggers the linking to use these
local bcftools copies.

the `*_c` are removed due to relying on `hts_internal.h`
(see fc9aeb6f77668afed412119701c5c58b0fca8091)
* added functions to loop over all regions
* lazy index build in case random access is not required
* support for chromosome names only, beg-end coordinates not mandatory
* set cap at maximum coordinate at 2147483647, hts_itr does not support larger
* tab and reg parsers will throw on finding a `0` to catch user
  error of using 0-based rather than 1-based coords
* `-r/--region` replaced by `-r/--regions` which will accept a comma separated
  list of regions as in other bcftools commands. `--region` still accepted
* `-R/--regions-file` option added to read regions from a file

This commit lifts over work originally done in cfd7cf9

Note: when more than one region is given, all indices are stored in memory,
which can be a problem when running on many bams. An alternative would be
to cache pre-filled `hts_itr`'s for each region.

Resolves samtools#369
…ools commands

the point of `--no-version` is to remove invocation specify metadata in the header
lines for pipeline systems that are tracking this separately. we are outputting
the `##reference` line though in mpileup. could drop this as well when
`--no-version` used. seems silly to add a separate option.
* prefix with ^ to negate the selection
* assign/rename samples by providing second field:
    RG_ID_1  SAMPLE_A
    RG_ID_2  SAMPLE_A
    RG_ID_3  SAMPLE_B
* on read group name conflict give the alignment file,
  asterisk for all reads in the file:
    RG_ID_1  FILE_1.bam  SAMPLE_A
    RG_ID_2  FILE_2.bam  SAMPLE_A
    *        FILE_3.bam  SAMPLE_C

Resolves 4th item in
samtools#414 (comment)
and samtools/samtools#324.
Our first foray into exploiting this is to cache the
bam_smpl_get_sample_id return value.  We compute this once in the
constructor (the first time we see a new bam1_t) instead of for every
pileup location in group_smpl.

TO DO: Group_smpl itself could now become distributed perhaps.  Rather
than an N*M loop clustering all sample IDs together, each new bam
could be added to sample struct on first appearance and removed when
it goes out of sight.

The slight caveat preventing this from being implemented immediately
is that the constructor/destructors are called for every BAM
overlapping the region rather than every filtered base that ultimately
ends up in a pileup.  Indeed sometimes we get constructors for
reads entirely filtered out.
Mann Whitney test now uses a precomputed table instead of continually
calculating the same values many times over.

calc_mwu_bias now has short-cuts to compute the result faster when one
or both of a[] and b[] hold zero values.
pd3 pushed a commit that referenced this pull request Feb 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants